40 research outputs found

    Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks

    Full text link
    Complex biological systems have been successfully modeled by biochemical and genetic interaction networks, typically gathered from high-throughput (HTP) data. These networks can be used to infer functional relationships between genes or proteins. Using the intuition that the topological role of a gene in a network relates to its biological function, local or diffusion based "guilt-by-association" and graph-theoretic methods have had success in inferring gene functions. Here we seek to improve function prediction by integrating diffusion-based methods with a novel dimensionality reduction technique to overcome the incomplete and noisy nature of network data. In this paper, we introduce diffusion component analysis (DCA), a framework that plugs in a diffusion model and learns a low-dimensional vector representation of each node to encode the topological properties of a network. As a proof of concept, we demonstrate DCA's substantial improvement over state-of-the-art diffusion-based approaches in predicting protein function from molecular interaction networks. Moreover, our DCA framework can integrate multiple networks from heterogeneous sources, consisting of genomic information, biochemical experiments and other resources, to even further improve function prediction. Yet another layer of performance gain is achieved by integrating the DCA framework with support vector machines that take our node vector representations as features. Overall, our DCA framework provides a novel representation of nodes in a network that can be used as a plug-in architecture to other machine learning algorithms to decipher topological properties of and obtain novel insights into interactomes.Comment: RECOMB 201

    Compact Integration of Multi-Network Topology for Functional Analysis of Genes

    Get PDF
    The topological landscape of molecular or functional interaction networks provides a rich source of information for inferring functional patterns of genes or proteins. However, a pressing yet-unsolved challenge is how to combine multiple heterogeneous networks, each having different connectivity patterns, to achieve more accurate inference. Here, we describe the Mashup framework for scalable and robust network integration. In Mashup, the diffusion in each network is first analyzed to characterize the topological context of each node. Next, the high-dimensional topological patterns in individual networks are canonically represented using low-dimensional vectors, one per gene or protein. These vectors can then be plugged into off-the-shelf machine learning methods to derive functional insights about genes or proteins. We present tools based on Mashup that achieve state-of-the-art performance in three diverse functional inference tasks: protein function prediction, gene ontology reconstruction, and genetic interaction prediction. Mashup enables deeper insights into the struct ure of rapidly accumulating and diverse biological network data and can be broadly applied to other network science domains. Keywords: interactome analysis; network integration; heterogeneous networks; dimensionality reduction; network diffusion; gene function prediction; genetic interaction prediction; gene ontology reconstruction; drug response predictionNational Institutes of Health (U.S.) (Grant R01GM081871

    Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks

    Get PDF
    Complex biological systems have been successfully modeled by biochemical and genetic interaction networks, typically gathered from high-throughput (HTP) data. These networks can be used to infer functional relationships between genes or pro- teins. Using the intuition that the topological role of a gene in a network relates to its biological function, local or diffusion-based “guilt-by-association” and graph- theoretic methods have had success in inferring gene functions [1, 2, 3]. Here we seek to improve function prediction by integrating diffusion-based methods with a novel dimensionality reduction technique to overcome the incomplete and noisy nature of network data. A type of diffusion algorithm, also known as random walk with restart (RWR), has been extensively studied in the context of biological networks and effectively applied to protein function prediction (e.g., [1]). The key idea is to propagate information along the network, in order to exploit both direct and indirect linkages between genes. Typically, a distribution of topological similar- ity is computed for each gene, in relation to other genes in the network, so that researchers can select the most related genes in the resulting distribution or, rather, select genes that share the most similar distributions. Though successful, these approaches are susceptible to noise in the input networks due to the high dimensionality of the computed distributions

    Embedding Node Structural Role Identity into Hyperbolic Space

    Full text link
    Recently, there has been an interest in embedding networks in hyperbolic space, since hyperbolic space has been shown to work well in capturing graph/network structure as it can naturally reflect some properties of complex networks. However, the work on network embedding in hyperbolic space has been focused on microscopic node embedding. In this work, we are the first to present a framework to embed the structural roles of nodes into hyperbolic space. Our framework extends struct2vec, a well-known structural role preserving embedding method, by moving it to a hyperboloid model. We evaluated our method on four real-world and one synthetic network. Our results show that hyperbolic space is more effective than euclidean space in learning latent representations for the structural role of nodes.Comment: In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20), October 19-23, 2020, Virtual Event, Irelan

    Biomedical data sharing and analysis at scale : privacy, compaction, and integration

    No full text
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019Cataloged from PDF version of thesis. Page 307 blank.Includes bibliographical references (pages 279-306).Recent advances in high-throughput experimental technologies have led to the exponential growth of biomedical datasets, including personal genomes, single-cell sequencing experiments, and molecular interaction networks. The unprecedented scale, variety, and distributed ownership of emerging biomedical datasets present key computational challenges for sharing and analyzing these data to uncover new scientific insights. This thesis introduces a range of computational methods that overcome these challenges to enable scalable sharing and analysis of massive datasets in a range of biomedical domains. First, we introduce scalable privacy-preserving analysis pipelines built upon modern cryptographic tools to enable large amounts of sensitive biomedical data to be securely pooled from multiple entities for collaborative science. Second, we introduce efficient computational techniques for analyzing emerging large-scale sequencing datasets of millions of cells that leverage a compact summary of the data to speedup various analysis tasks while maintaining the accuracy of results. Third, we introduce integrative approaches to analyzing a growing variety of molecular interaction networks from heterogeneous data sources to facilitate functional characterization of poorly-understood genes. The computational techniques we introduce for scaling essential biomedical analysis tasks to the large volume of data being generated are broadly applicable to other data science domains.by Hyunghoon Cho.Ph. D.Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Scienc

    Emerging technologies towards enhancing privacy in genomic data sharing

    No full text
    As the scale of genomic and health-related data explodes and our understanding of these data matures, the privacy of the individuals behind the data is increasingly at stake. Traditional approaches to protect privacy have fundamental limitations. Here we discuss emerging privacy-enhancing technologies that can enable broader data sharing and collaboration in genomics research.NIH (Grant R01GM01108348
    corecore